Good Bigrams

نویسنده

  • Christer Johansson
چکیده

A desired property of a measure of connective strength in bigrams is that the measure should be insensitive to corpus size. This paper investigates the stability of three different measures over text genres and expansion of the corpus. The measures are (1) the commonly used mutual information, (2) the difference in mutual information, and (3) raw occurrence. Mutual information is further compared to using knowledge about genres to remove overlap between genres. This last approach considers the difference between two products of the same process (human text-generation) constrained by different genres. The cancellation of overlap seems to provide the most specific word pairs for each genre.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using the Web to Obtain Frequencies for Unseen Bigrams

This paper shows that the web can be employed to obtain frequencies for bigrams that are unseen in a given corpus. We describe a method for retrieving counts for adjectivenoun, noun-noun, and verb-object bigrams from the web by querying a search engine. We evaluate this method by demonstrating: (a) a high correlation between web frequencies and corpus frequencies; (b) a reliable correlation bet...

متن کامل

Terms Derived from Frequent Sequences for Extractive Text Summarization

Automatic text summarization helps the user to quickly understand large volumes of information. We present a languageand domain-independent statistical-based method for single-document extractive summarization, i.e., to produce a text summary by extracting some sentences from the given text. We show experimentally that words that are parts of bigrams that repeat more than once in the text are g...

متن کامل

The use of bigrams to enhance text categorization

In this paper, we present an efficient text categorization algorithm that generates bigrams selectively by looking for ones that have an especially good chance of being useful. The algorithm uses the information gain metric, combined with various frequency thresholds. The bigrams, along with unigrams, are then given as features to two different classifiers: Naïve Bayes and maximum entropy. The ...

متن کامل

Viterbi beam search with layered bigrams

We outline an implementation of Viterbi beam search that incorporates layered bigrams. Layered bigrams are class bigrams in which some nodes are themselves bigrams, resulting in a recursive structure. The implementation is in C++ and involves a hierarchy of classes. The paper outlines the main concepts and the corresponding C++ classes.

متن کامل

Word Clustering using Long Distance Bigram Language Models

Two novel word clustering techniques employing language models of long distance bigrams are proposed. The first technique is built on a hierarchical clustering algorithm and minimizes the sum of Mahalanobis distances of all words after cluster merger from the centroid of the resulting class. The second technique resorts to the probabilistic latent semantic analysis (PLSA). Interpolated versions...

متن کامل

Retrieving Collocations From Korean Text

This paper describes a statistical methodology ibr automatically retrieving collocations from POS tagged Korean text using interrupted bigrams. The free order of Korean makes it hard to identify collocations. We devised four statistics, 'frequency', 'randomness', 'condensation', and 'correlation' .to account for the more flexible word order properties of Korean collocations. We extracted meanin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996